99 research outputs found
Rethinking Implicit Neural Representations for Vision Learners
Implicit Neural Representations (INRs) are powerful to parameterize
continuous signals in computer vision. However, almost all INRs methods are
limited to low-level tasks, e.g., image/video compression, super-resolution,
and image generation. The questions on how to explore INRs to high-level tasks
and deep networks are still under-explored. Existing INRs methods suffer from
two problems: 1) narrow theoretical definitions of INRs are inapplicable to
high-level tasks; 2) lack of representation capabilities to deep networks.
Motivated by the above facts, we reformulate the definitions of INRs from a
novel perspective and propose an innovative Implicit Neural Representation
Network (INRN), which is the first study of INRs to tackle both low-level and
high-level tasks. Specifically, we present three key designs for basic blocks
in INRN along with two different stacking ways and corresponding loss
functions. Extensive experiments with analysis on both low-level tasks (image
fitting) and high-level vision tasks (image classification, object detection,
instance segmentation) demonstrate the effectiveness of the proposed method
Diverse Target and Contribution Scheduling for Domain Generalization
Generalization under the distribution shift has been a great challenge in
computer vision. The prevailing practice of directly employing the one-hot
labels as the training targets in domain generalization~(DG) can lead to
gradient conflicts, making it insufficient for capturing the intrinsic class
characteristics and hard to increase the intra-class variation. Besides,
existing methods in DG mostly overlook the distinct contributions of source
(seen) domains, resulting in uneven learning from these domains. To address
these issues, we firstly present a theoretical and empirical analysis of the
existence of gradient conflicts in DG, unveiling the previously unexplored
relationship between distribution shifts and gradient conflicts during the
optimization process. In this paper, we present a novel perspective of DG from
the empirical source domain's risk and propose a new paradigm for DG called
Diverse Target and Contribution Scheduling (DTCS). DTCS comprises two
innovative modules: Diverse Target Supervision (DTS) and Diverse Contribution
Balance (DCB), with the aim of addressing the limitations associated with the
common utilization of one-hot labels and equal contributions for source domains
in DG. In specific, DTS employs distinct soft labels as training targets to
account for various feature distributions across domains and thereby mitigates
the gradient conflicts, and DCB dynamically balances the contributions of
source domains by ensuring a fair decline in losses of different source
domains. Extensive experiments with analysis on four benchmark datasets show
that the proposed method achieves a competitive performance in comparison with
the state-of-the-art approaches, demonstrating the effectiveness and advantages
of the proposed DTCS
Uncertainty-Aware Consistency Regularization for Cross-Domain Semantic Segmentation
Unsupervised domain adaptation (UDA) aims to adapt existing models of the
source domain to a new target domain with only unlabeled data. Many
adversarial-based UDA methods involve high-instability training and have to
carefully tune the optimization procedure. Some non-adversarial UDA methods
employ a consistency regularization on the target predictions of a student
model and a teacher model under different perturbations, where the teacher
shares the same architecture with the student and is updated by the exponential
moving average of the student. However, these methods suffer from noticeable
negative transfer resulting from either the error-prone discriminator network
or the unreasonable teacher model. In this paper, we propose an
uncertainty-aware consistency regularization method for cross-domain semantic
segmentation. By exploiting the latent uncertainty information of the target
samples, more meaningful and reliable knowledge from the teacher model can be
transferred to the student model. In addition, we further reveal the reason why
the current consistency regularization is often unstable in minimizing the
distribution discrepancy. We also show that our method can effectively ease
this issue by mining the most reliable and meaningful samples with a dynamic
weighting scheme of consistency loss. Experiments demonstrate that the proposed
method outperforms the state-of-the-art methods on two domain adaptation
benchmarks, GTAV Cityscapes and SYNTHIA
Cityscapes
Context-Aware Mixup for Domain Adaptive Semantic Segmentation
Unsupervised domain adaptation (UDA) aims to adapt a model of the labeled
source domain to an unlabeled target domain. Existing UDA-based semantic
segmentation approaches always reduce the domain shifts in pixel level, feature
level, and output level. However, almost all of them largely neglect the
contextual dependency, which is generally shared across different domains,
leading to less-desired performance. In this paper, we propose a novel
Context-Aware Mixup (CAMix) framework for domain adaptive semantic
segmentation, which exploits this important clue of context-dependency as
explicit prior knowledge in a fully end-to-end trainable manner for enhancing
the adaptability toward the target domain. Firstly, we present a contextual
mask generation strategy by leveraging the accumulated spatial distributions
and prior contextual relationships. The generated contextual mask is critical
in this work and will guide the context-aware domain mixup on three different
levels. Besides, provided the context knowledge, we introduce a
significance-reweighted consistency loss to penalize the inconsistency between
the mixed student prediction and the mixed teacher prediction, which alleviates
the negative transfer of the adaptation, e.g., early performance degradation.
Extensive experiments and analysis demonstrate the effectiveness of our method
against the state-of-the-art approaches on widely-used UDA benchmarks.Comment: Accepted to IEEE Transactions on Circuits and Systems for Video
Technology (TCSVT
DMT: Dynamic Mutual Training for Semi-Supervised Learning
Recent semi-supervised learning methods use pseudo supervision as core idea,
especially self-training methods that generate pseudo labels. However, pseudo
labels are unreliable. Self-training methods usually rely on single model
prediction confidence to filter low-confidence pseudo labels, thus remaining
high-confidence errors and wasting many low-confidence correct labels. In this
paper, we point out it is difficult for a model to counter its own errors.
Instead, leveraging inter-model disagreement between different models is a key
to locate pseudo label errors. With this new viewpoint, we propose mutual
training between two different models by a dynamically re-weighted loss
function, called Dynamic Mutual Training (DMT). We quantify inter-model
disagreement by comparing predictions from two different models to dynamically
re-weight loss in training, where a larger disagreement indicates a possible
error and corresponds to a lower loss value. Extensive experiments show that
DMT achieves state-of-the-art performance in both image classification and
semantic segmentation. Our codes are released at
https://github.com/voldemortX/DST-CBC .Comment: Reformatte
TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers
Detection Transformer (DETR) and Deformable DETR have been proposed to
eliminate the need for many hand-designed components in object detection while
demonstrating good performance as previous complex hand-crafted detectors.
However, their performance on Video Object Detection (VOD) has not been well
explored. In this paper, we present TransVOD, the first end-to-end video object
detection system based on spatial-temporal Transformer architectures. The first
goal of this paper is to streamline the pipeline of VOD, effectively removing
the need for many hand-crafted components for feature aggregation, e.g.,
optical flow model, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods
such as Seq-NMS. In particular, we present a temporal Transformer to aggregate
both the spatial object queries and the feature memories of each frame. Our
temporal transformer consists of two components: Temporal Query Encoder (TQE)
to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to
obtain current frame detection results. These designs boost the strong baseline
deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID
dataset. Then, we present two improved versions of TransVOD including
TransVOD++ and TransVOD Lite. The former fuses object-level information into
object query via dynamic convolution while the latter models the entire video
clips as the output to speed up the inference time. We give detailed analysis
of all three models in the experiment part. In particular, our proposed
TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet
VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and
accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single
V100 GPU device.Comment: Accepted to IEEE Transactions on Pattern Analysis and Machine
Intelligence (IEEE TPAMI), extended version of arXiv:2105.1092
Inclusive wealth index measuring sustainable development potentials for Chinese cities
The UN Sustainable Development Goals (SDGs) are the blueprint to achieve a better and more sustainable future. To achieve the goal, tracking progress — not just on a national level, but locally — is crucial to guide future policy development. While sustainability assessment at the national evel is quite advanced in China, similar assessments focusing at the regional or even at the city-level are currently lacking. Here, we advanced the Inclusive Wealth Index (IWI) framework, which is firstly proposed by the United Nations Development Programme, through taking water wealth into account and adjusting the variable based on data availability. Then we investigate the sustainability performance of 210 cities in China in 2016 via the advanced version of the IWI framework. The analysis makes a holistic assessment based on produced, human, and natural capital, as well as considering heterogeneities in economy, social, and environmental conditions across these cities. We find that cities clustered in the eastern parts of China are characterized by high levels of sustainability performance and increasing capacities for sustainability, largely driven by their high quality and quantity of human capital. In comparison, the western cities have a large amount of low-skilled human capital and low levels of produced capital, which determines their low sustainability performance. Cities clustered in the north are heavily dependent on low value-added products and resource-intensive industries. Furthermore, we make projections of the IWI and its three components for different cities from 2020 to 2030, referring to the index systems presented in city planning which describe the development speed of income, education, fixed asset investment, forests etc. In the future, cities in central and western clusters show considerable potential for increasing IWI per capita, whereas cities with a dominant energy sector in the north would face declining capacity for sustainability due to the exhaustion of fossil fuels and raw materials. By fully taking account of and adapting to local circumstances, we tailor-design pathways for different types of cities to grow their sustainability potentials. Those resources-dependent cities in the north could avoid the impending decline by gradually developing their human and produced capital while abandoning their resource dependency. Our study contributes to city-level sustainable development in China through the lens of per capita IWI and the potential future dynamics of changing compositions in their capital
Flames: Benchmarking Value Alignment of Chinese Large Language Models
The widespread adoption of large language models (LLMs) across various
regions underscores the urgent need to evaluate their alignment with human
values. Current benchmarks, however, fall short of effectively uncovering
safety vulnerabilities in LLMs. Despite numerous models achieving high scores
and 'topping the chart' in these evaluations, there is still a significant gap
in LLMs' deeper alignment with human values and achieving genuine harmlessness.
To this end, this paper proposes the first highly adversarial benchmark named
Flames, consisting of 2,251 manually crafted prompts, ~18.7K model responses
with fine-grained annotations, and a specified scorer. Our framework
encompasses both common harmlessness principles, such as fairness, safety,
legality, and data protection, and a unique morality dimension that integrates
specific Chinese values such as harmony. Based on the framework, we carefully
design adversarial prompts that incorporate complex scenarios and jailbreaking
methods, mostly with implicit malice. By prompting mainstream LLMs with such
adversarially constructed prompts, we obtain model responses, which are then
rigorously annotated for evaluation. Our findings indicate that all the
evaluated LLMs demonstrate relatively poor performance on Flames, particularly
in the safety and fairness dimensions. Claude emerges as the best-performing
model overall, but with its harmless rate being only 63.08% while GPT-4 only
scores 39.04%. The complexity of Flames has far exceeded existing benchmarks,
setting a new challenge for contemporary LLMs and highlighting the need for
further alignment of LLMs. To efficiently evaluate new models on the benchmark,
we develop a specified scorer capable of scoring LLMs across multiple
dimensions, achieving an accuracy of 77.4%. The Flames Benchmark is publicly
available on https://github.com/AIFlames/Flames
- …